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Abstract 

Convenient  access  to  handwritten  historical  document  col¬ 
lections  in  libraries  generally  requires  an  index,  which  al¬ 
lows  one  to  locate  individual  text  units  (pages,  sentences, 
lines)  that  are  relevant  to  a  given  query  ( usually  provided  as 
text).  Currently,  extensive  manual  labor  is  used  to  annotate 
and  organize  such  collections,  because  handwriting  recog¬ 
nition  approaches  provide  only  poor  results  on  old  docu¬ 
ments. 

In  this  work,  we  present  a  novel  retrieval  approach  for 
historical  document  collections,  which  does  not  require 
recognition.  We  assume  that  word  images  can  be  described 
using  a  vocabulary  of  discretized  word  features.  From  a 
training  set  of  labeled  word  images,  we  extract  discrete  fea¬ 
ture  vectors,  and  estimate  the  joint  probability  distribution 
of  features  and  word  labels.  For  a  given  feature  vector  ( i.  e.  a 
word  image),  we  can  then  calculate  conditional  probabili¬ 
ties  for  all  labels  in  the  training  vocabulary.  Experiments 
show  that  this  relevance -based  language  model  works  very 
well  with  a  mean  average  precision  of  89%  for  4-word 
queries  on  a  subset  of  George  Washington ’s  manuscripts. 
We  also  show  that  this  approach  may  be  extended  to  general 
shapes  by  using  the  same  model  and  a  similar  feature  set  to 
retrieve  general  shapes  in  two  different  shape  datasets. 

1.  Introduction 

Libraries  are  in  the  transition  from  offering  strictly  paper- 
based  material  to  providing  electronic  versions  of  their  col¬ 
lections.  For  simple  access,  multimedia  information,  such 
as  audio,  video  or  images,  requires  an  index  that  allows  one 
to  retrieve  data,  which  is  relevant  to  a  given  text  query. 

At  this  time,  historical  manuscripts  like  George  Wash¬ 
ington’s  correspondence  are  manually  transcribed  in  order 
to  provide  input  to  a  text  search  engine.  Unfortunately,  the 
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cost  of  this  approach  is  prohibitive  for  large  collections.  Au¬ 
tomatic  approaches  using  handwriting  recognition  cannot 
be  applied  (see  results  in  [20]),  since  the  current  technology 
for  recognizing  handwriting  from  images  has  only  been  suc¬ 
cessful  in  domains  with  very  limited  lexicons  and/or  high 
redundancy,  such  as  legal  amount  processing  on  checks  and 
automatic  mail  sorting.  An  alternative  approach  called  word 
spotting  [18]  which  performs  word  image  clustering  is  cur¬ 
rently  only  computationally  feasible  for  small  collections. 

Here  we  present  an  approach  to  retrieving  handwrit¬ 
ten  historical  documents  from  a  single  author,  using  a 
relevance-based  language  model  [11,  12].  Relevance  mod¬ 
els  have  been  successfully  used  for  both  retrieval  and  cross¬ 
language  retrieval  of  text  documents  and  more  recently  for 
image  annotation [9].  In  their  original  form,  these  models 
capture  the  joint  statistical  occurrence  pattern  of  words  in 
two  languages,  which  are  used  to  describe  a  certain  domain 
(e.g.  a  news  event). 

This  paradigm  can  be  used  for  any  signal  domain,  by 
describing  images/shapes/. . .  with  visterms  -  words  from 
a  feature  vocabulary ,  thus  generating  a  “signal  description 
language”.  When  the  joint  statistical  occurrence  patterns 
of  visterms  and  the  image  annotation  vocabulary  (e.g.  word 
image  labels)  are  learned,  one  can  perform  tasks  such  as 
image  retrieval  using  text  queries,  or  automatic  annotation. 
While  our  focus  here  is  on  handwritten  documents,  where 
our  signals  to  be  retrieved  are  images  of  words,  we  later 
show  that  our  approach  can  be  easily  adapted  to  work  with 
general  shapes. 

In  this  work,  we  model  the  occurrence  pattern  of  words 
in  two  languages  using  the  joint  probability  distribution 
over  the  visterm  and  annotation  vocabulary.  From  a  train¬ 
ing  set  of  annotated  images  of  handwritten  words,  we  learn 
this  joint  probability  distribution  and  perform  retrieval  ex¬ 
periments  with  text  queries  on  a  test  set.  Word  images  are 
described  using  a  vocabulary  that  is  derived  from  a  set  of 
word  shape  features. 

Our  model  differs  from  others  in  a  number  of  respects. 
Unlike  traditional  handwriting  recognition  paradigms  [13], 
our  approach  does  not  require  perfect  recognition  for  good 
retrieval.  The  work  presented  here  is  also  related  to  models 
used  for  object  recognition/image  annotation  and  retrieval 
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[6,  1,  3,  9].  However,  those  approaches  were  proposed  for 
annotating/retrieving  general-purpose  photographs  and  pri¬ 
marily  used  color  and  texture  as  features.  Here  we  focus  on 
shape  features  for  retrieval  tasks,  but  the  approach  here  can 
be  extended  to  many  shape-related  retrieval  and  annotation 
tasks  in  computer  vision. 

Using  this  relevance-based  language  model,  we  have 
conducted  retrieval  experiments  on  a  set  of  20  pages 
from  the  George  Washington  collection  at  the  Library  of 
Congress.  The  mean  average  precision  scores  we  achieve 
lie  in  the  range  from  54%  to  89%  for  queries  using  1  to  4 
words  (respectively).  These  are  very  good  results,  consid¬ 
ering  the  noise  in  historical  documents.  Retrieval  experi¬ 
ments  on  general  shapes  from  the  MPEG-7  and  COIL- 100 
[16]  datasets  1  yielded  mean  average  precision  of  87%  and 
up  to  97%  respectively. 

In  the  following  section  we  discuss  prior  work  in  the 
field,  followed  by  a  detailed  description  of  the  relevance- 
based  model  in  section  2.  After  briefly  explaining  the  fea¬ 
tures  used  in  our  approach  (section  3),  we  present  line- 
retrieval  results  on  the  George  Washington  collection  (sec¬ 
tion  4)  and  show  how  our  retrieval  approach  can  be  ex¬ 
tended  to  general  shapes  in  section  5.  Section  6  concludes 
the  paper. 

1.1.  Previous  Work 

There  are  a  number  of  approaches  reported  in  the  litera¬ 
ture,  which  model  the  statistical  co-occurrence  patterns  of 
image  features  and  annotation  words,  in  order  to  perform 
such  diverse  tasks  as  image  annotation,  object  recognition 
and  image  retrieval.  Mori  et  al.  [15]  estimate  the  likelihood 
of  annotation  terms  appearing  in  a  given  image,  by  mod¬ 
eling  the  co-occurrence  relationship  between  clustered  fea¬ 
ture  vectors  and  annotation  terms.  Duygulu  et  al.  [6]  go 
one  step  further  by  actually  annotating  individual  image  re¬ 
gions  (rather  than  producing  sets  of  keywords  for  an  im¬ 
age),  which  is  in  effect  object  class  recognition.  Barnard 
and  Forsyth  [1]  extended  Hofmann’s  Hierarchical  Aspect 
Model  for  text  and  proposed  a  multi-modal  approach  to  hi¬ 
erarchical  clustering  of  images  and  words  using  EM.  Blei 
and  Jordan  [3]  extended  their  Latent  Dirichlet  Allocation 
(LDA)  Model  and  proposed  a  Correspondence  LDA  model, 
which  relates  words  and  images. 

The  authors  of  [9]  introduced  the  model  used  in  this 
work  for  automatic  image  annotation  and  retrieval.  With  the 
same  data  and  feature  set,  the  results  for  image  annotation 
were  dramatically  better  than  previous  models  -  for  exam¬ 
ple  twice  as  good  as  the  translation  model  [6].  This  work 
extends  that  model  to  a  different  domain  (word  images  in 
a  noisy  document  environment),  uses  an  improved  feature 

*We  extracted  silhouettes  from  the  COIL- 100  dataset  in  order  to  use  it 
in  our  shape  retrieval  experiments. 


representation  and  different  attributes  (shape).  Shape  has 
to  be  described  by  features  that  are  very  different  from  the 
previously  utilized  color  and  texture  features.  We  test  the 
model  on  a  data  set  with  a  larger  annotation  vocabulary  than 
previous  experiments  and  a  feature  vector  discretization  that 
preserves  more  detail  than  the  clustering  algorithms  which 
are  utilized  in  other  approaches.  In  addition,  our  appli¬ 
cation  (line  retrieval)  uses  a  new  retrieval  model  formula¬ 
tion.  Other  authors  have  previously  suggested  document- 
retrieval  systems  that  do  not  require  recognition,  but  queries 
have  to  be  issued  in  the  form  of  examples  in  the  image 
domain  (e.g.  see  [19]).  To  our  knowledge,  our  system  is 
the  first  to  allow  retrieval  without  recognition  using  text 
queries.  We  also  demonstrate  that  this  approach  easily  ex¬ 
tends  to  more  general  shapes  using  two  different  data  col¬ 
lections  -  the  MPEG-7  and  COIL- 100  datasets. 

All  of  the  image-to-word  translation  approaches  we  are 
aware  of,  operate  on  image  collections  of  good  quality 
(e.g.  the  Corel  image  data  base  [6,  9]),  which  usually  con¬ 
tain  color  and  texture  information.  Color  is  known  to  be  one 
of  the  most  useful  features  for  describing  objects.  Duygulu 
et  al.  [6],  for  example,  use  half  of  the  entries  in  their  fea¬ 
ture  vectors  for  color  information.  Images  of  handwritten 
words,  on  the  other  hand,  do  not  generally  contain  color  or 
texture  information,  and  in  the  case  of  historical  documents, 
the  image  quality  is  often  greatly  reduced. 

The  lack  of  other  features  makes  shape  a  typical  choice 
for  offline  handwriting  recognition  approaches.  We  make 
use  of  holistic  word  shape  features  which  are  justified  by 
psychological  studies  of  human  reading[13],  and  are  widely 
used  in  the  field  [5,  18,  21]. 

Our  extension  to  general  shape  retrieval  makes  use  of  a 
very  similar  feature  set  and  allows  querying  using  ASCII 
text,  which  is  in  contrast  to  the  many  query -by -example  re¬ 
trieval  approaches  (see  e.g.  [8,  14]).  The  goal  was  not  to 
produce  the  best  possible  shape  retrieval  system,  but  rather 
to  demonstrate  the  generality  of  our  shape  retrieval  model. 
With  highly  specialized  shape  features,  such  as  those  de¬ 
scribed  in  [22]),  it  is  likely  that  even  higher  precision  scores 
could  be  achieved. 

2.  Model  Formulation 

Before  explaining  our  model  in  detail,  we  would  like  to  pro¬ 
vide  some  intuition  for  it.  Previous  research  in  cross-lingual 
information  retrieval  has  shown  that  co-occurrence  proba¬ 
bilities  of  words  in  two  languages  (e.g.  English  and  Chi¬ 
nese)  can  be  effectively  estimated  from  a  parallel  corpus, 
that  is,  a  collection  of  document  pairs,  where  each  docu¬ 
ment  is  available  in  two  languages.  Reliable  estimates  can 
be  achieved  even  without  any  knowledge  of  the  involved 
languages.  One  approach  to  this  problem  assumes  that 
the  joint  distributions  of,  say  English  and  Chinese  words, 
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are  determined  from  a  training  set  and  may  then  be  subse¬ 
quently  used  to  compute  the  probability  of  occurrence  of 
the  term  e  in  an  English  document  given  the  occurrence  of 
the  terms  q  in  a  Chinese  document  [23]. 

By  analogy,  word  images  may  be  described  using  two 
different  vocabularies  -  an  image  description  language  -  vis- 
terms  -  and  the  textual  (ASCII)  representation  of  the  word. 
To  obtain  visterms,  we  extract  features  from  the  images  and 
discretize  them,  giving  us  a  discrete  vocabulary  for  each 
word  image.  From  a  set  of  labeled  images  of  words  we  can 
then  estimate  the  joint  probability  P(w,  f\ . . .  /&),  where 
w  is  a  word  label  (the  word  “transcription”)  and  the  fi  are 
words  from  the  image  description  language.  Using  the  con¬ 
ditional  density  P(w\fi . . .  fk)  we  can  perform  retrieval  of 
handwritten  text  without  recognition  with  high  accuracy. 


{w,  f\. .  .fk},  and  would  like  to  compute  the  probability  of 
that  observation  appearing  as  a  random  sample  somewhere 
in  our  corpus  C.  Because  the  observation  is  not  tied  to  any 
position,  we  have  to  estimate  the  probability  as  the  expecta¬ 
tion  over  every  position  i  in  our  entire  collection  C : 

P(w,fi...fk)  =  Ei[P(Wi  =  w,f1...fk\Ii)} 

1  |C|  k 

=  I*  e  won*™*)  <2> 

1  1 i= i  j= i 

Here  \C\  denotes  the  aggregate  number  of  word  positions 
in  the  collection.  Equation  (2)  gives  us  a  powerful  formal¬ 
ism  for  performing  automatic  annotation  and  retrieval  over 
handwritten  documents. 


2.1.  Model  Estimation 

Suppose  we  have  a  collection  C  of  annotated  manuscripts. 
We  will  model  this  collection  as  a  sequence  of  random 
variables  Wi,  one  for  each  word  position  i  in  C.  Each 
variable  Wi  takes  on  a  dual  representation:  W-h  =  {hj^Wj}, 
where  hi  is  the  image  of  the  handwritten  form  at  position  i 
in  the  collection  and  Wi  is  the  corresponding  transcription 
of  the  word.  As  we  describe  in  the  following  section, 
we  will  represent  the  surface  form  hi  as  a  set  of  discrete 
features  .  ./),&  from  some  feature  “vocabulary”  H. 
The  transcription  Wi  is  simply  a  word  from  the  English 
vocabulary  V.  Consequently,  each  random  variable  Wi 
takes  values  of  the  form  {wi,  fi:±. .  -fi,k}-  In  the  remaining 
portions  of  this  section  we  will  discuss  how  we  can  estimate 
a  probability  distribution  over  the  variables  Wi. 

We  assume  that  for  each  position  i  (i.e.  image  Ii)  in  the 
collection  there  exists  an  underlying  multinomial  probabil¬ 
ity  distribution  P(-\Ii)  over  the  union  of  the  vocabularies  V 
and  H.  Intuitively,  our  model  can  be  thought  of  as  an  urn 
containing  all  the  possible  features  that  can  appear  in  a  rep¬ 
resentation  of  the  word  image  Ii  as  well  as  all  the  words  as¬ 
sociated  with  that  word  image.  We  assume  that  an  observed 
feature  representation  /i . .  .fk  is  the  result  of  k  random  sam¬ 
ples  from  this  model.  It  follows  from  the  urn  model  that 
the  probabilities  of  observing  are  mutually  in¬ 

dependent  once  we  pick  a  word  image  Ii  with  represen¬ 
tation  Wi.  We  further  assume  that  actual  observed  values 
{ w ,  fi. .  .fk}  represent  an  i.i.d.  random  sample  drawn  from 
P(-\Ii).  Then,  the  probability  of  a  particular  observation  is 
given  by: 


k 

p(Wi =w,h.. . fk\h ) = p(w\ii)  n p(fj\ii)  (i) 

3  = 1 

Now  suppose  we  are  given  an  arbitrary  observation  W  = 


2.2.  Automatic  Annotation  and  Retrieval  of 
Manuscripts 

Suppose  we  are  given  a  training  collection  C  of  annotated 
manuscripts,  and  a  target  collection  T  where  no  annotations 
are  provided.  Given  an  arbitrary  handwritten  image  h  we 
can  automatically  compute  its  image  vocabulary  (^feature) 
representation  f\. .  .fk  and  then  use  equation  (2)  to  predict 
the  words  w  which  are  likely  to  occur  jointly  with  the  fea¬ 
tures  of  h.  These  predictions  would  take  the  form  of  a  con¬ 
ditional  probability: 


P(w\h...fk) 


P(w,fl...fk) 

£„eVPK/i.../fe) 


(3) 


This  probability  could  be  used  directly  to  annotate 
new  handwritten  images  with  highly  probable  words.  We 
provide  a  brief  evaluation  for  this  kind  of  annotation  in 
section  4.2.  However,  if  we  are  interested  in  retrieving 
sections  of  manuscripts  we  can  make  another  use  of 
equation  (3). 


Suppose  we  are  given  a  user  query  Q  =  q\. .  .qm.  We 
would  like  to  retrieve  sections  SCI  of  the  target  collection 
that  contain  the  query  words.  More  generally,  we  would  like 
to  rank  the  sections  S  by  the  probability  that  they  are  rele¬ 
vant  to  Q.  One  of  the  most  effective  methods  for  ranked  re¬ 
trieval  is  based  on  the  statistical  language  modeling  frame¬ 
work  [17].  In  this  framework,  sections  S  of  text  are  ranked 
by  the  probability  that  the  query  Q  would  be  observed  dur¬ 
ing  i.i.d.  random  sampling  of  words  from  S: 

rri 

P(Q|S)  =  n^W)  (4) 

3  = 1 

In  text  retrieval,  estimating  the  probability  P(qj\S )  is 
straightforward  -  we  just  count  how  many  times  the  word 
qj  actually  occurred  in  S ,  and  then  normalize  and  smooth 


3 


the  counts.  When  we  are  dealing  with  handwritten  docu¬ 
ments  we  do  not  know  what  words  did  or  did  not  occur  in  a 
given  section  of  text.  However,  we  can  use  the  conditional 
estimate  provided  by  equation  (3): 

i  |s| 

P(qj\S)  =  —J2P((H\fo,i---fo,k)  (5) 

'  '  0=1 

Here  \S\  refers  to  the  number  of  word-images  in  S,  the 
index  o  goes  over  all  positions  in  S,  and  /0,i  -  • .  f0,k  repre¬ 
sent  a  set  of  features  derived  from  the  word  image  in  posi¬ 
tion  o.  Combining  equations  (4)  and  (5)  provides  us  with  a 
complete  system  for  handwriting  retrieval. 

2.3.  Estimation  Details 

In  this  section  we  provide  the  estimation  details  necessary 
for  a  successful  implementation  of  our  model.  In  order 
to  use  equation  (2)  we  need  estimates  for  the  multinomial 
models  P(-\Ii)  that  underly  every  position  i  in  the  training 
collection  C.  We  estimate  these  probabilities  via  smoothed 
relative  frequencies: 

P(x\ii)  =  Y~^kS(x  e 

+  (1  +  fc)|C|  /«.!•••/«, fc»  (6) 

where  8{x  G  {w,  f\. .  ./&})  is  a  set  membership  function, 
equal  to  one  if  and  only  if  x  is  either  w  or  one  of  the  feature 
vocabulary  terms  /i. .  Parameter  A  controls  the  degree 
of  smoothing  on  the  frequency  estimate  and  can  be  tuned 
empirically. 

3.  Features  and  Discretization 

The  word  shape  features  we  use  in  this  work  are  described 
in  [10]  (the  feature  section  of  that  article  was  submitted 
to  the  conference  review  system  as  an  anonymized  supple¬ 
mental  file).  They  are  holistic  word  shape  features,  ranging 
from  word  image  width/height  to  low-order  discrete  Fourier 
transform  (DFT)  coefficients  of  word  shape  profiles  (see 
Figure  1).  This  feature  set  allows  us  to  represent  each  im¬ 
age  of  a  handwritten  word  with  a  continuous-space  feature 
vector  of  constant  length. 

With  these  feature  sets  we  get  a  26-dimensional  vector 
for  word  shapes.  These  representations  are  in  continuous- 
space,  but  the  relevance  model  requires  us  to  represent  all 
feature  vectors  in  terms  of  a  visterm  vocabulary  of  fixed 
size.  Previous  approaches  [9]  use  clustering  of  feature  vec¬ 
tors,  where  each  cluster  corresponds  to  one  visterm.  How¬ 
ever,  this  approach  is  rather  aggressive,  since  it  considers 
words  or  shapes  to  be  equal  if  they  fall  into  the  same  clus¬ 
ter. 


(a)  Cleaned  and  normalized  word  image, 


(b)  resulting  upper  and  lower  profile  features  displayed  together. 


Figure  1:  Two  of  the  three  shape  profile  features. 

We  chose  a  discretization  method  that  preserves  a  greater 
level  of  detail,  by  separately  binning  each  dimension  of  a 
feature  vector.  Whenever  a  feature  value  falls  into  a  partic¬ 
ular  bin,  an  associated  visterm  is  added  to  the  discrete- space 
representation  of  the  word  or  shape.  We  used  two  overlap¬ 
ping  binning  schemes  -  the  first  divides  each  feature  dimen¬ 
sion  into  10  bins  while  the  second  creates  an  additional  9 
bins  shifted  by  half  a  bin  size.  The  additional  bins  are  used 
to  assign  similar  feature  values  to  at  least  one  same  visterm. 
After  discretization,  we  have  52  visterms  per  word  image. 
The  entire  visterm  vocabulary  contains  26  •  19  =  494  en¬ 
tries. 

4.  Handwriting  Retrieval  Experiments 

We  will  discuss  two  types  of  evaluation.  First,  we  briefly 
look  at  the  predictive  capability  of  the  annotation  as  out¬ 
lined  in  section  2.  We  train  a  model  on  a  small  set  of  an¬ 
notated  manuscripts  and  evaluate  how  well  the  model  was 
able  to  annotate  each  word  in  a  held-out  portion  of  the 
dataset.  Then  we  turn  to  evaluating  the  model  in  the  context 
of  ranked  retrieval. 

The  data  set  we  used  in  training  and  evaluating  our 
approach  consists  of  20  manually  annotated  pages  from 
George  Washington’s  handwritten  letters.  Segmenting  this 
collection  yielded  a  total  of  4773  images,  from  which  the 
majority  contain  exactly  one  word.  An  estimated  5-10%  of 
the  images  contain  segmentation  errors  of  varying  degrees: 
parts  of  words  that  have  faded  tend  to  get  missed  by  the 
segmentation,  and  occasionally  images  contain  2  or  more 
words  or  only  a  word  fragment. 

4.1.  Evaluation  Methodology 

Our  dataset  comprises  4773  total  word  occurrences  ar¬ 
ranged  on  657  lines.  Because  of  the  relatively  small  size 
of  the  dataset,  all  of  our  experiments  use  a  10-fold  random¬ 
ized  cross-validation,  where  each  time  the  data  is  split  into  a 
90%  training  and  10%  testing  sets.  Splitting  was  performed 
on  a  line  level,  since  we  chose  lines  to  be  our  retrieval  unit. 
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Quality  of  Ranked  Retrieval 


Quality  of  Automatic  Annotation 


Recall 

Figure  2:  Performance  on  annotating  word  images  with 
words. 

Prior  to  any  experiments,  the  manual  annotations  were  re¬ 
duced  to  the  root  form  using  the  Krovetz  morphological  an¬ 
alyzer.  This  is  a  standard  practice  in  information  retrieval, 
it  allows  one  to  search  for  semantically  similar  variants  of 
the  same  word.  For  our  annotation  experiments  we  use  ev¬ 
ery  word  of  the  4773- word  vocabulary  that  occurs  in  both 
the  training  and  the  testing  set.  For  retrieval  experiments, 
we  remove  all  function  words,  such  as  “of”,  “the”,  “and”, 
etc.  Furthermore,  to  simulate  real  queries  users  might  pose 
to  our  system,  we  tested  all  possible  combinations  of  2,  3 
and  4  words  that  occurred  on  the  same  line  in  the  testing, 
but  not  necessarily  in  the  training  set.  Function  words  were 
excluded  from  all  of  these  combinations. 

We  use  the  standard  evaluation  methodology  of  infor¬ 
mation  retrieval.  In  response  to  a  given  query,  our  model 
produces  a  ranking  of  all  lines  in  the  testing  set.  Out  of 
these  lines  we  consider  only  the  ones  that  contain  all  query 
words  to  be  relevant.  The  remaining  lines  are  assumed  to 
be  non-relevant.  Then  for  each  line  in  the  ranked  list  we 
compute  recall  and  precision.  Recall  is  defined  as  the  num¬ 
ber  of  relevant  lines  above  (and  including)  the  current  line, 
divided  by  the  total  number  of  relevant  lines  for  the  current 
query.  Similarly,  precision  is  defined  as  number  of  above 
relevant  lines  divided  by  the  rank  of  the  current  line.  Re¬ 
call  is  a  measure  of  what  percent  of  relevant  lines  we  found, 
and  precision  suggests  how  many  non-relevant  lines  we  had 
to  look  at  to  achieve  that  recall.  In  our  evaluation  we  use 
plots  of  precision  vs.  recall,  averaged  over  all  queries  and 
all  cross-validation  repeats.  We  also  report  Mean  Average 
Precision,  which  is  an  average  of  precision  values  at  all  re¬ 
call  points. 

4.2.  Discussion  of  Results 

Figure  2  shows  the  performance  of  our  model  on  the  task 
of  assigning  word  labels  to  handwritten  images.  We  carried 


Recall 

Figure  3:  Performance  on  ranked  retrieval  with  different 
query  sizes. 

out  two  types  of  evaluation.  In  position-level  evaluation, 
we  generated  a  probability  distribution  P(w\fi^. .  -fi,k)  for 
every  position  i  in  the  testing  set.  Then  we  looked  for  the 
rank  of  the  correct  word  w  in  that  distribution  and  averaged 
the  resulting  recall  and  precision  over  all  positions.  Since 
we  did  not  exclude  function  words  at  this  stage,  position- 
level  evaluation  is  strongly  biased  toward  very  common 
words  such  as  “of”,  “the”  etc.  These  words  are  generally 
not  very  interesting,  so  we  carried  out  a  word-level  evalua¬ 
tion.  Here  for  a  given  word  w  we  look  at  the  ranked  list  of 
all  the  positions  i  in  the  testing  set,  sorted  in  the  decreasing 
order  of  P{w\fi^  .  ./*,&).  This  is  similar  to  running  w  as  a 
query  and  retrieving  all  positions  in  which  it  could  possibly 
occur.  Recall  and  precision  were  calculated  as  discussed  in 
the  previous  section. 

From  the  graphs  in  Figure  2  we  observe  that  our  model 
performs  quite  well  in  annotation.  For  position-level 
annotation,  we  achieve  50%  precision  at  rank  1,  which 
means  that  for  a  given  position  i ,  half  the  time  the  word  w 
with  the  highest  conditional  probability  P{w\fi^. .  ./*,&) 
is  the  correct  one.  Word-oriented  evaluation  also  has  close 
to  50%  precision  at  rank  1,  meaning  that  for  a  given  word 
w  the  highest-ranked  position  i  contains  that  word  almost 
half  the  time.  Mean  Average  Precision  values  are  54%  and 
52%  for  position-oriented  and  word-oriented  evaluations 
respectively. 

Now  we  turn  our  attention  to  using  our  model  for  the 
task  of  retrieving  relevant  portions  of  manuscripts.  As  dis¬ 
cussed  before,  we  created  four  sets  of  queries:  1,  2,  3  and 
4  words  in  length,  and  tested  them  on  retrieving  line  seg¬ 
ments.  Our  experiments  involve  a  total  of  1950  single- word 
queries,  1939  word  pairs,  1870  3-word  and  1558  4-word 
queries  over  657  lines.  Figure  3  shows  the  recall-precision 
graphs.  It  is  very  encouraging  to  see  that  our  model  per¬ 
forms  extremely  well  in  this  evaluation,  reaching  over  90% 
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mean  precision  at  rank  1 .  This  is  an  exceptionally  good  re¬ 
sult,  showing  that  our  model  is  nearly  flawless  when  even 
such  short  queries  are  used.  Mean  average  precision  values 
were  54%,  63%,  78%  and  89%  for  1-,  2-,  3-  and  4-word 
queries  respectively.  Figures  4  and  5  show  two  retrieval  re¬ 
sults  with  variable-length  queries.  We  have  implemented  a 
demo  web-interface  for  our  retrieval  system,  which  can  be 
found  at  <URL  omitted  for  review  process>. 

5.  General  Shapes 

We  performed  proabilistic  annotation  and  retrieval  exper¬ 
iments  on  the  MPEG-7  shape  and  COIL- 100  datasets  to 
demonstrate  the  extensibility  of  our  model  and  features  to 
general  shapes. 

The  feature  set  for  the  retrieval  of  general  shapes  was 
adapted  by  removing  the  estimate  for  the  number  of  de¬ 
scenders  (word- specific)  and  the  image  height  and  width 
features  (redundant  after  shape  normalization).  In  order  to 
get  more  accurate  representations  of  the  shapes,  the  projec¬ 
tion  profile  and  upper/lower  profiles  were  complemented  by 
also  calculating  them  for  the  shape  at  a  90  degree  rotation 
angle.  With  these  feature  sets  we  get  a  44-dimensional  vec¬ 
tor  for  general  shapes  as  compared  to  the  26-dimensional 
vector  for  word  shapes.  Discretization  as  before  gives  88 
visterms  per  shape  with  a  total  visterm  vocabulary  size  of 
44  •  19  =  861. 

5.1.  MPEG-7  Dataset 

The  MPEG-7  dataset  (see  Figure  6)  consists  of  1400  shape 
images  of  70  shape  categories,  with  20  examples  per  cat¬ 
egory  (e.g.  “apple”).  To  prepare  the  shapes  for  the  fea¬ 
ture  extraction,  we  performed  a  closing  operation  on  each 
shape,  rotated  it  so  that  its  principal  axis  is  oriented  hori¬ 
zontally  and  normalized  its  size.  After  the  feature  vectors 
were  extracted  and  discretized  into  visterms  (see  section 
3),  we  performed  retrieval  experiments  using  10-fold  cross- 
validation.  For  the  retrieval  experiments,  we  ran  70  ASCII 
queries  on  the  testing  set.  Each  of  the  unique  70  shape  cat¬ 
egory  labels  serves  as  a  query. 


(a)  bird,  (b)  lizzard,  (c)  Misk. 


Figure  6:  MPEG7  shape  examples  with  annotations  (from 
file  names). 


For  each  cross-validation  run  we  have  a  90%  train¬ 
ing/1 0%  testing  split  of  the  entire  dataset.  We  performed 
retrieval  experiments  on  the  training  portion  in  order  to  de¬ 
termine  the  smoothing  parameters  A  for  the  visterm  and 
annotation  vocabularies.  The  smoothing  parameters  that 
yieldeded  the  best  retrieval  performance  are  then  used  for 
retrieval  on  the  testing  split. 


Mean  average  precision 

Standard  deviation 

87.24% 

4.24% 

Table  1 :  Mean  average  precision  results  for  the  retrieval  ex¬ 
periments  on  the  MPEG-7  shape  dataset  averaged  over  10 
cross-validation  runs  with  standard  deviation. 

Table  1  shows  the  mean  average  precision  results  we 
achieved  with  the  10  cross-validation  runs.  Even  with  this 
very  simple  extension  of  our  word-features  and  the  same 
model  we  can  get  very  high  retrieval  performance  at  87% 
mean  average  precision.  It  is  important  to  note  that  in 
contrast  to  the  common  query -by-content  retrieval  systems, 
which  require  some  sort  of  shape  drawing  as  a  query,  we 
have  actually  learned  each  shape  category  concept,  and  can 
retrieve  similar  shapes  with  an  ASCII  query. 

5.2.  COIL- 100  Dataset 

In  the  MPEG-7  dataset,  each  shape  is  usually  seen  from  the 
side.  For  increased  complexity  we  turned  to  the  COIL- 100 
dataset  [16].  This  dataset  contains  7200  color  images  of 
100  household  objects  and  toys.  Each  object  was  placed  on 
a  turntable  and  an  image  was  taken  for  every  5  degrees  of 
rotation,  resulting  in  72  images  per  object.  We  converted 
the  color  images  into  shapes  by  binarizing  the  images  (see 
Figure  7  for  examples)  and  normalizing  their  sizes.  In  or¬ 
der  to  facilitate  retrieval  using  text  queries,  each  object  was 
labeled  with  one  of  45  class  labels  (these  are  also  used  as 
queries). 

After  extracting  features  and  turning  them  into  visterms, 
we  performed  retrieval  experiments  with  varying  numbers 
of  training  examples  per  object  category.  The  number  of 
examples  per  object  are  (evenly  spaced  througout  360  de¬ 
grees  of  rotation):  1,  2,  4,  8,  18,  and  36.  Once  the  training 
examples  are  selected,  we  pick  9  shapes  per  object  at  ran¬ 
dom  from  the  remaining  shapes.  This  set,  which  contains  a 
total  of  9  •  100  =  900  shapes,  is  used  to  train  the  smooth¬ 
ing  parameters  of  the  retrieval  model.  From  the  remaining 
shapes,  another  9  shapes  per  object  are  selected  at  random 
to  form  the  testing  set  on  which  we  determine  the  retrieval 
performance. 

Figure  8  shows  the  mean  average  precision  results  ob¬ 
tained  in  this  experiment  (“all  queries”  plot).  Unfortunately 
we  were  not  able  to  show  any  retrieval  examples  due  to 
space  constraints.  The  “reduced  query  set”  plot  shows  the 
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rank  1 : 


rank2: 


rank3: 


Figure  4:  Retrieval  result  for  the  4-word  query  “sergeant  wilper  fort  Cumberland”  (one  relevant  line  in  collection). 


Figure  5:  Retrieval  result  for  the  3-word  query  “men  Virginia  regiment”  (one  relevant  line  in  collection). 


(a)  original,  (b)  original, 


(c)  extracted  shape  “box”,  (d)  extracted  shape  “car”. 


Figure  7:  COIL- 100  dataset  examples:  original  color  im¬ 
ages  and  extracted  shapes  with  our  annotations. 


same  experiment,  where  queries  are  omitted  for  objects  that 
are  invariant  under  the  turntable  rotation  performed  during 
the  COIL- 100  dataset  acquisition.  As  expected,  the  average 
precision  scores  are  slightly  lower,  but  the  differences  be¬ 
come  negligible  when  there  are  many  examples  per  object 
(for  36  examples,  the  “reduced  query  set”  is  actually  about 
.5%  better  than  “all  queries”). 

These  results  are  very  encouraging,  since  they  indi- 


examples  per  object 

Figure  8:  Retrieval  results  on  the  COIL- 100  dataset  for  dif¬ 
ferent  numbers  of  examples  per  object.  The  reduced  query 
set  excludes  queries  for  objects  that  appear  invariant  under 
the  rotation  performed  during  the  dataset  acquisition. 

cate  we  can  perform  satisfactory  retrieval  at  around  80% 
mean  average  precision  (m.a.p.)  for  8  examples  per  object 
(45  degrees  apart)  and  high  performance  retrieval  at  97% 
m.a.p.  for  36  examples  per  object  (10  degrees  apart).  Note 
that  this  is  done  exclusively  on  shape  images  (without  us¬ 
ing  any  intensity  information).  Clearly,  if  other  information 
and  a  more  specialized  feature  set  were  used,  even  higher 
precision  scores  could  be  achieved. 

6.  Summary  and  Conclusion 

We  have  presented  a  relevance-based  language  model  for 
the  retrieval  of  handwritten  documents  and  general  shapes. 
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Our  model  estimates  the  joint  probability  of  occurrence  of 
annotation  and  feature  vocabulary  terms  in  order  to  per¬ 
form  probabilistic  annotation  and  retrieval  of  handwritten 
words  (documents)  and  general  shapes.  Our  approach  is 
the  first  to  use  shape-based  features,  and  we  presented  ap¬ 
propriate  shape  representation,  discretization  and  retrieval 
techniques.  The  results  for  the  retrieval  of  lines  of  hand¬ 
written  text  indicate  performance  at  a  level  that  is  practical 
for  real-world  applications. 

Future  work  will  include  a  retrieval  system  for  a  larger 
collection  with  page  retrieval.  Extending  the  collection 
could  require  more  features  in  order  to  discriminate  better 
between  similar  words.  Lastly,  we  also  plan  to  work  on  im¬ 
proved  retrieval  models. 
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